


Using Text Data Processing 


LESSON OVERVIEW 
In this lesson, you will learn how to use text data processing. 


LESSON OBJECTIVES 
After completing this lesson, you will be able to: 


e Use the entity extraction transform 


Text Data Processing 
Facts about the explosion of data during the information age include the following: 


e In2003, there were five exabytes of data in the world, double the amount of data from 
2000. (US Berkeley) 


e The amount of information created, stored, and replicated worldwide has grown tenfold in 
five years. (IDC 2008) 


e 95% of digital data is unstructured. (IDC 2007) 

Details about the percentage of unstructured digital data: 

e 70to 95% of stored data is unstructured. (Butler Group) 

e 80% of business is conducted using unstructured data. (Gartner Group) 

e The amount of unstructured data doubles every three months. (Gartner Group) 
¢ Seven million new Web pages are created every day. (Gartner Group) 


To deal with this explosion of the digital universe in size and complexity, IT organizations face 
three main imperatives. 


1. IT organizations need to transform their existing relationships with the business units. It 
will take all competent hands in an organization to deal with information creation, storage, 
management, security, retention, and disposal in an enterprise. Dealing with the digital 
universe is not a technical problem alone. 


2. IT teams need to spearhead the development of organization-wide policies for information 
governance: information security, information retention, data access, and compliance. 


3. IT teams need to rush new tools and standards into the organization, from storage 
optimization, unstructured data search, and database analytics to resource pooling 
(virtualization) and management and security tools. All will be required to make the 
information infrastructure as flexible, adaptable, and scalable as possible. 


The number of new open source platforms for unstructured data—such as Hadoop— 
underscore the increasing role of unstructured information in business. 
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